Focused web crawling system and method thereof

ABSTRACT

The present invention relates to a system for focused web crawling comprising a crawler, a distiller, a queuing unit and a classifying module arranged to undergo a method for focused web crawling that inputs a seed address into a subsequently formed address queue, iteratively extracts a primary address from the address queue, iteratively invigilates the primary address for presence in an address store, and follows a series of steps to conduct relevancy check of the addresses via naive bayes protocol, simultaneously calculates primary conditional probability of a set of predefined webpage(s) using the protocol, sequentially calculates plurality of secondary conditional probabilities pertaining to the webpage(s) of the iteratively extracted primary addresses, further classifies the webpage(s) as relevant/irrelevant webpage(s) and finally transfers addresses of the relevant webpage(s) and the relevant set of addresses into the address queue, else into the address store.

STATEMENT REGARDING DEPARTMENT OF SCIENCE AND TECHNOLOGY (DST) SPONSEREDRESEARCH PROJECT

The invention was made with government of India support and is funded byDepartment of Science and Technology Government of India. The endproduct in the form of Management Information System (MIS) that shallshowcase the achievements of Indian scientists and academicians workingabroad and highlighting achievements of Indian women scientists andacademicians. The resulting database shall also be useful to scientificcommunity and other stakeholders in forging research and academiccollaborations, policy planning, etc.

FIELD OF THE INVENTION

The present invention relates to the field of web crawlers systems. Morespecifically, present invention relates to a focused web crawling systemand method that ensures fetching of user-desired content only throughefficiently derived scrutiny of webpages and addresses linked thereto.

BACKGROUND OF THE INVENTION

Background description includes information that may be useful inunderstanding the present invention. It is not an admission that any ofthe information provided herein is prior art or relevant to thepresently claimed invention, or that any publication specifically orimplicitly referenced is prior art.

To effectively use the vast amount of information available online,webpages that comprise World Wide Web (WWW) need to be classified.Webpage classification is essential not only to satisfy knowledge growthof academics but also required to provide quick and efficient solutionsof information analysis for industry. According to a survey conducted byGlobal Growth Markets (GGM) and commissioned by Elsevier (National DietLibrary 2011), nine out of ten doctors in Asia-Pacific rely on onlinesearch engine to aid clinical decisions.

CN106294364B discloses a method and device for realizing web crawler tocapture web pages. The method comprises the steps that web pagesbelonging to different websites are divided into different web pageclusters in advance, and/or web pages belonging to different products inthe same website are divided into different web page clusters; themethod comprises the following steps: for any webpage cluster, countinga minimum confidence interval of the sleep time of the webpage clusterduring capturing when the capturing success rate of the webpage clustermeets a preset confidence level; configuring the sleep time of thewebpage cluster during capturing within the minimum confidence intervalrange; and informing the configured sleep time to the web crawler sothat the web crawler captures the web pages in the web page clusteraccording to the configured sleep time. Through the method and thedevice, the problem that the capturing success rate and the capturingefficiency cannot be effectively guaranteed simultaneously when the webpages in different websites or the web pages of different products inthe same website are captured in the prior art can be solved. Theembodiment of the application also discloses a device for realizing webcrawler to capture web pages.

Moreover, Google, a general purpose search engine, is the most popularand non-evidence based search engine used by doctors. Moreover, by 2022,in every 73 days, the medical knowledge will double in volume. Thebiggest disadvantage of such general purpose search engine is that thedomain with information is to be fetched is very large. Hence thecollection of webpages they return contains a lot of irrelevant data aswell.

Focused web crawling systems and methods can be a solution to suchproblems, wherein the focused crawler decides which URLs to explore toreach webpages of interest. Deciding the relevancy of a webpage inaccordance with a topic of interest can be considered as a supervisedlearning problem. Binary classifiers are used to take the decision,whether a webpage is relevant or irrelevant according to topic of userinterest. Therefore a set of pre-downloaded webpages can be used as atraining example set for classifier to make future decision easy.

Hence, there is a need to envision a focused crawling system and methodthat can search webpages by the topics and can index the webpages ofinterest instead of gathering all the webpages, thereby acquiring onlyuser-desired webpages adaptively, accurately and speedily.

OBJECTS OF THE INVENTION

The principal object of the present invention is to overcome thedisadvantages of the prior art.

An object of the present invention is to provide a web crawling systemand method that adaptively acquires only relevant pages of interest,thus executing focused web crawl.

Another object of the present invention is to provide a web crawlingsystem and method that efficiently implements one or more advancedfiltering techniques onto web addresses, thereby retaining only relevantones.

Another object of the present invention is to provide a web crawlingsystem and method involving an advanced probabilistic classificationcriteria for scrutinizing webpages and addresses thereof.

Yet another object of the present invention is to provide a web crawlingsystem and method that makes efficient use of trie data structure toconduct keyword matching of webpages and verification of addresseslinked thereto, thus facilitating fast web crawling.

The foregoing and other objects, features, and advantages of the presentinvention will become readily apparent upon further review of thefollowing detailed description of the preferred embodiment asillustrated in the accompanying drawings.

SUMMARY OF THE INVENTION

The present invention relates a web crawling system and method thatfacilitates user-defined scrutiny of webpages and addresses of thewebpages through execution of multiple filtering techniques and anadvanced probabilistic classification criteria, thereby delivering userdesired web results.

According to an embodiment of present invention, the system for focusedweb crawling comprises a crawler configured to extract addresses ofplurality of webpages similar to a receivable at least one seed address,a distiller configured to sequentially refine the addresses usingplurality of filtration techniques and naïve bayes protocol, therebytransferring a first set of relevant addresses for being iterativelypassed onto the crawler and a classifying module operable to categorizethe plurality of webpages for relevancy via the protocol, thereafterderiving a second set of relevant addresses for being iteratively passedonto the distiller.

According to another embodiment of present invention, the crawlerretrieves the plurality of webpages as well as addresses associated withthe plurality of webpages, the system further comprises of a queuingunit for maintaining at least one list of the addresses extracted fromthe crawler, the queuing unit strategically updates the list based onsequential inputs received from the crawler, the classifying moduleformulates an intelligence matrix from a training unit configured toconceptualize the relevancy and the training unit further includes akeyword extraction unit containing a list of keywords to be analyzedtherein for the conceptualization.

According to another embodiment of present invention, the front addressfrom the first set of relevant addresses is passed onto the crawler.According to another embodiment of present invention, the crawler alsomaintains a crawling history that includes but not limited to crawledpart of the plurality of webpage, time taken to download a file, numberof the iterations. According to another embodiment of present invention,the plurality of filtration techniques are selected to be but notlimited to checking top level domain of the addresses, checking no outof domain address, checking duplicity in already processed addresses,checking duplicity in yet to be processed addresses, discardingaddresses based on irrelevant keywords.

According to another embodiment of present invention, the training unitconducts procedures including but not restricted to stopwordelimination, stemming, generation of set of features based on occurrencefrequency, implementation of the naïve bayes protocol. According toanother embodiment of present invention, the training unit furthershortlists the set of features using approaches including but notlimited to document frequency approach, information gain approach,chi-square statistics approach, term strength approach. According toanother embodiment of present invention, the classifying modulecategorizes the plurality of webpages by comparing the intelligencematrix.

The present invention also relates to a method to implement focused webcrawling, comprising steps of inputting a seed address into asubsequently formed address queue, iteratively extracting a primaryaddress from the address queue, iteratively invigilating the primaryaddress for presence in an address store, wherein if not present,extracting set of secondary addresses from webpage of the primaryaddress, applying plurality of filtering techniques as a passingcriteria on the set of secondary addresses, wherein if passed, verifyingthe set of secondary addresses for presence of a set of predefinedkeywords, upon successful verification, classifying the set of secondaryaddresses for relevancy via naive bayes protocol, transferring relevantset of secondary addresses into the address queue, else into the addressstore, simultaneously calculating primary conditional probability of aset of predefined webpage(s) using the protocol, sequentiallycalculating plurality of secondary conditional probabilities pertainingto the webpage(s) of the iteratively extracted primary addresses,classifying the webpage(s) having the secondary conditional probabilityhigher than the primary conditional probability as relevant webpage(s),else irrelevant webpage(s) and transferring addresses of the relevantwebpage(s) into the address queue, else into the address store.

According to an embodiment of present invention, the primary address ispreferably front address in the address queue. According to anotherembodiment of present invention, the plurality of filtration techniquesare selected to be but not limited to checking top level domain of theaddresses, checking no out of domain address, checking duplicity inalready processed addresses, checking duplicity in yet to be processedaddresses, discarding addresses based on irrelevant keywords.

According to another embodiment of present invention, calculation of thesecondary conditional probability is supported by plurality ofpreliminary procedures including but not restricted to stopwordelimination, stemming, generation of set of features based on occurrencefrequency. According to another embodiment of present invention,classification of the relevant set of secondary addresses and addressesof the relevant webpage(s) occurs concurrently.

While the invention has been described and shown with particularreference to the preferred embodiment, it will be apparent thatvariations might be possible that would fall within the scope of thepresent invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are included to provide a furtherunderstanding of the present disclosure and are incorporated in andconstitute a part of this specification. The drawings illustrateexemplary embodiments of the present disclosure and, together with thedescription, serve to explain the principles of the present disclosure.

In the figures, similar components and/or features may have the samereference label. Further various components of the same type may bedistinguished by following the reference label with a second label thatdistinguishes among the similar components. If only the first referencelabel is used in the specification, the description is applicable to anyof the similar components having the same reference label irrespectiveof the second reference label.

FIG. 1 illustrates a schematic diagram of a focused web crawling system,according to an embodiment;

FIG. 2 illustrates a block diagram of a training unit configured toformulate an intelligence matrix used as a reference for assessingrelevancy of webpages, according to an embodiment;

FIG. 3 shows a flow chart of a sequence of steps carried out in aclassifying module, according to an embodiment;

FIG. 4 illustrates a flow chart of a focused web crawling methodexecuted to classify webpages and addresses of webpages, according to anembodiment;

FIG. 5 illustrates a flow chart of a concurrently executed second partof the focused web crawling method for classification of webpages,according to an embodiment;

FIG. 6A shows experimental performance chart of precision verses truepositive for disclosed naïve bayes protocol in comparison to otherstate-of-the art protocols, according to an embodiment; and

FIG. 6B shows experimental performance chart of harvest ratio versesretrieved webpages for disclosed naïve bayes protocol in comparison toother state-of-the-art protocols, according to an embodiment.

DETAILED DESCRIPTION OF THE INVENTION

As used in the description herein and throughout the claims that follow,the meaning of “a,” “an,” and “the” includes plural reference unless thecontext clearly dictates otherwise. Also, as used in the descriptionherein, the meaning of “in” includes “in” and “on” unless the contextclearly dictates otherwise.

If the specification states a component or feature “may”, “can”,“could”, or “might” be included or have a characteristic, thatparticular component or feature is not required to be included or havethe characteristic.

Exemplary embodiments will now be described more fully hereinafter withreference to the accompanying drawings, in which exemplary embodimentsare shown. This disclosure may however, be embodied in many differentforms and should not be construed as limited to the embodiments setforth herein. These embodiments are provided so that this disclosurewill be thorough and complete and will fully convey the scope of thedisclosure to those of ordinary skill in the art. Moreover, allstatements herein reciting embodiments of the disclosure, as well asspecific examples thereof, are intended to encompass both structural andfunctional equivalents thereof. Additionally, it is intended that suchequivalents include both currently known equivalents as well asequivalents developed in the future (i.e., any elements developed thatperform the same function, regardless of structure).

Various terms as used herein are shown below. To the extent a term usedin a claim is not defined below, it should be given the broadestdefinition persons in the pertinent art have given that term asreflected in printed publications and issued patents at the time offiling.

In some embodiments, the numerical parameters set forth in the writtendescription and attached claims are approximations that can varydepending upon the desired properties sought to be obtained by aparticular embodiment. In some embodiments, the numerical parametersshould be construed in light of the number of reported significantdigits and by applying ordinary rounding techniques. Notwithstandingthat the numerical ranges and parameters setting forth the broad scopeof some embodiments of the invention are approximations, the numericalvalues set forth in the specific examples are reported as precisely aspracticable. The numerical values presented in some embodiments of theinvention may contain certain errors necessarily resulting from thestandard deviation found in their respective testing measurements.

The present invention relates to a focused web crawling system andmethod that employs naïve bayes protocol based classificationsupplemented with advanced filtering techniques to adaptively andaccurately acquire user desired web crawling results only.

Referring to FIG. 1, a schematic diagram of the disclosed systemarchitecture is shown that comprises of a crawler, a distiller, aclassifying module, a queuing unit, a training unit and a keywordextraction unit. The seed address is the starting address for theiterations being performed by the crawler. World Wide Web is acollection of unlimited number of webpages of different domains.Plurality of webpages can be retrieved and addresses thereof can beextracted from those retrieved plurality of webpages.

The queueing unit is used to store and thus maintain at least one listof addresses found by the crawler, wherein the unit strategicallyupdates the list based on sequential inputs received from the crawler.The distiller refines the extracted addresses using plurality offiltration techniques and naïve bayes protocol, thereby transferring afirst set of relevant addresses for being iteratively passed onto saidcrawler. Distiller makes a probabilistic decision of which addressesneed to be explored further to reach relevant webpages. These filteredaddresses are repeatedly given as input to the crawler to furtherexplore the webpages. Once a webpage is extracted, naïve bayes protocolconducts a relevancy check to classify them as relevant or irrelevant,thereafter deriving a second set of relevant addresses for beingiteratively passed onto the distiller. Whilst both distiller andclassifier adopt naïve bayes Protocol, the former classifies theaddresses only and the later classifies the addresses only.

The crawler receives seed addresses as input to reach the targetwebpages. The queuing unit stores all the addresses extracted by thecrawler. These extracted addresses are given as input to the distillerto refine the addresses that need to be explored further. During eachiteration, preferably the front address from the filtered address'squeueing unit can be removed and passed to the crawler. The addresses inthe queueing unit simply use first in first out rule. Duplicateaddresses in the address queue are not added as this is presented by thedistiller.

Moreover, crawler sends a hypertext transfer protocol (HTTP) request andreads the response, hypertext transfer protocol client also sets atimeout to handle non-responsive web servers. During the implementationrobot exclusion protocol is also considered to follow the serverprovided access policies. The crawler also maintains a crawling historythat includes but not limited to crawled part of the plurality ofwebpage, time taken to download a file, number of the iterations. Thisis done to get an insight into a process for future improvement of thecrawling process.

According to an embodiment of present invention, the distiller isemployed to discard highly irrelevant addresses through plurality offiltering techniques and naïve bayes protocol as well. Herein, a “Highlyirrelevant” address means those addresses that can move the crawler intoa section of website which will increase the overall distance of currentwebpage to relevant webpage. Firstly, the filtration techniques can beapplied to plurality of addresses to decide highly irrelevant addressesand thereafter naïve bayes protocol can be used to filter them further.The filtration techniques employed by the distiller are selected frombut not limited to checking top-level domain of the addresses, checkingno out of domain address, checking duplicity in already processedaddresses, checking duplicity in yet to be processed addresses,discarding addresses based on irrelevant keywords. A detailedexplanation of the filtering techniques is provided below:

a) Checking the top level domain of addresses: To avoid out of domainaddresses, the top Level Domain (TLD) of the addresses is checked andmatched with the TLD of the seed address. As an example, the seedaddress www.stanford.edu as TLD as “edu”. So any address whose TLD isdifferent from “edu” will be discarded, thus helping save a lots ofresources.b) No out of domain Address: There may be two or more universitieshaving “.edu” as TLD but may represent completely different website. Ifcrawler is allowed to follow the addresses of same TLD, even then it canbe an exhaustive process. So after applying TLD this domain filter willkeep the crawler in same domain as that of seed address.c) Checking duplicity in already processed addresses: For alreadyprocessed addresses, even after checking for TLD and restricting crawlerto the domain of seed address, there can be multiple paths leading to asingle webpage on a website. So a check is kept by distiller to find ifthe address is already been processed or not.

A trie data structure is maintained by distiller for this filter. Trieis used because it is very efficient to search strings having commonprefixes in the trie. Trie can also be used for matching string in thedictionary that is used specifically in our project implementation.Frequently used hash table is avoided here for two reasons: First, hashtable cannot be used for prefix based search and second, it takes morespace than trie data structure.

The use of trie data structure can provide test searching of an address.Using trie, the search complexities can be brought to an optimal limitthat is the length of address to be searched. Mathematically, if N isthe number of addresses present in the trie and M is the length of theaddress to be searched, then the complexity of searching is O(M), ratherthan O(N*M).

d) Checking duplicity in yet to be processed addresses: This is for thecase in which if an address is found and saved in an address queuewaiting to be processed. The crawler may find multiple instances of thesame address and add all the instances to the address queue. So beforeadding any address to the address queue, another trie is maintained thatcontains all the addresses which were once added to address queue. Thiswill ultimately avoid reprocessing storing multiple instances in theaddress queue that may lead to memory overflow for large websites.e) Discarding addresses but on irrelevant keywords: The addresses arefurther checked for some keywords that are used to mark the address asirrelevant. The presence of these keywords is checked in the addresstext itself and also in the text of address to webpage. Depending upontopic of item set of a crawler, addresses are discarded.

The working of classifier and distiller can be divided into two mainsteps: training and classification. In training phase, a training unitis created by domain expert that consists of webpages classified asrelevant or irrelevant and an Intelligence Matrix is derived thatdefines the features of on-topic webpages.

As indicated in FIG. 2, a training set for the classifier is shown,configured to formulate an intelligence matrix used as a reference forassessing relevancy of webpages. The webpages in training repository arepreprocessed to eliminate stopwords, for example a, an, for, the, etc.and furthered by stemming procedure. In stemming, root word of the stemwords can be determined. Then a keyword set is used to create a featureset based on frequency of occurrence of these keywords in the webpagesand position on webpage. In the end, Naïve Bayes Protocol calculatesprobability of belongingness of a webpage to a user-desired topic, thusproducing a feature set with corresponding probabilities that can beused before classification. The output of training unit is anIntelligence Matrix that is used by the classifying unit.

Referring to FIG. 3, a flow chart of a sequence of steps carried out ina classifying module is presented. Herein, stemming and stop-wordelimination is performed on webpage text. The input to this trainedclassifier is an Intelligence Matrix for the webpage whose fate is to bedecided. To calculate the required matrix, firstly, the number ofoccurrences of each word in the document can be counted. TheIntelligence Matrix is given as input to the trained classifier thatfurther estimates the probability of webpages to be relevant orirrelevant, thereafter deriving a second set of relevant addresses forbeing iteratively passed onto said distiller. Similar training andclassification phases are carried out by distiller that classifies theaddresses as relevant or irrelevant.

According to an embodiment of present invention, time taken by thealgorithm to decide relevancy of a webpage depends highly on the featurespace. Hence, the training unit further shortlists the set of featuresusing approaches including but not limited to document frequencyapproach, information gain approach, chi-square statistics approach,term strength approach.

In an aspect, Document Frequency (DF) approach has been found to be mostsuitable and thus chosen for the study. The DF uses the assumption thatterms which occur rarely on webpage are either non-informative forrelevance prediction or they do not influence global performance ofclassifier. So, both cases support removal of rare terms to reduce thedimensionality of feature space. DF is a simple and scalable technique.For each unique term in the training unit, DF is computed and thoseterms were removed from feature space whose document frequency is lessthan predefined threshold.

The Basic Naive Bayes protocol is given by:

$\begin{matrix}{{P\left( C \middle| X \right)} = \frac{{P\left( X \middle| C \right)}.{P(C)}}{P(X)}} & (1)\end{matrix}$

Wherein, C_(i) is the class in which webpages are to be classifiedi∈{0,1} where 0 denotes relevant set of sample data and 1 denotesirrelevant and X is the feature set of one sample data. P(C|X) denotesthe conditional probability for webpage X to be relevant when knowing X.Also, P(X) can be ignored because it remains constant for one X ∀C. So,above equation (1) can be rewritten as:

P(C|X)∞P(X|C)·P(C)

P(C) is calculated based on training webpage dataset. P(C) denotes totalnumber of training data set and, for n selected features,

$\begin{matrix}{{P\left( \frac{X}{C} \right)} = {{{P\left( \frac{X_{1}}{C} \right)}.{P\left( \frac{X_{2}}{C} \right)}.{P\left( \frac{X_{3}}{C} \right)}}\ldots{P\left( \frac{X_{n}}{C} \right)}}} & (2)\end{matrix}$

In general, equation (2) becomes

$\prod_{i = 1}^{n}{P\left( \frac{X_{i}}{C} \right)}$

Thus, the probability of a feature set belongs to a relevant webpage isdecided as:

${P\left( X_{j} \middle| C \right)} = {\frac{1}{\sqrt{2\pi\sigma_{x,c}^{2}}}e^{{- \frac{1}{2}}*{\lbrack\frac{{({x_{j} - \mu_{x,c}})}^{2}}{\sigma_{x,c}^{2}}\rbrack}}}$${{Where}\mu_{x,c}} = {\frac{1}{n_{1}}{\sum{X_{j}\left( {{\forall_{j}{:y_{j}}} = c} \right)}}}$$\sigma_{x,c}^{2} = {\frac{1}{n_{1}}{\sum{\left( {x_{j} - \mu_{x,c}} \right)^{2}\left( {{\forall_{j}{:y_{j}}} = c} \right)}}}$

Where n₁ is number of instances of c in y.

Referring to FIG. 4, a flow chart of a focused web crawling method isshown that is executed to classify webpages and the addresses of thewebpages. Seed address is the address that recommends the webpage ofinterest to the crawler which is to be explored. The primary address,which is basically the front address in the address queue can beextracted and added to a variable parent address and its presence issearched in the address store. If the address is present, then the samestep is repeated until we find an address that has not been explored.Upon finding such an address, all the addresses on the webpage ofprimary address, referred to as the set of secondary addresses, areextracted for further processing. The secondary addresses represent thetextual data that is displayed on a webpage, clicking on which redirectsthe user to a new address or webpage. All the afore-discussed filteringtechniques are applied as a next stage passing criteria of said set ofsecondary addresses.

Subsequently, upon passing the aforementioned criteria, the set ofsecondary addresses can be verified for presence of a set of predefinedkeywords, in accordance with the topic of interest. Upon successfulverification, the set of secondary addresses are classified forrelevancy via naive bayes protocol. If found relevant, the address isadded to the address queue and the whole method is repeated again.However, if the address is found to be irrelevant, it is added to theaddress store.

Referring to FIG. 5, concurrently, the classifying unit performs anothermethod and a flow chart of a concurrently executed second part of thefocused web crawling method for classification of webpages are shown.The method involves a training phase and a classification phase. In thetraining phase, the primary conditional probability of a set ofpredefined webpage(s) is calculated using the Naïve Bayes Protocol asprovided in equation 3. |V| is the size of vocabulary and W_(t) isanyone word in V. |D| is the number of training webpages, N is thenumber of times w appears in webpage Z_(i). The value of P(C|Z_(i))=1when Z_(i) belongs to category C, otherwise, its value is zero.

In the classification phase, the primary addresses from the front ofaddress queue are extracted and check for presence in address store. Ifnot present, we proceed to the next step of classification of thewebpage W_(t). This is performed by calculating plurality of secondaryconditional probabilities pertaining to the webpage(s) of saiditeratively extracted primary addresses. More specifically, theconditional probability P(C_(k)|Z_(t)) for webpage Z_(t) belongs to thecategory C_(k) knowing webpage Z_(t). In this step, N_(tr) is the numberof times r appears on webpage Z_(t), R represents all distinct words inwebpage Z_(t). Finally, the addresses of the relevant webpage(s) aretransferred to the address queue, else into the address store.

To empirically study the performance of our proposed algorithm, we usedour algorithm for extracting Indian origin academicians from foreignuniversity websites. The persons skilled in the art would appreciate thefact, that in this case study, academicians of Indian nationality areconsidered but the proposed approach can be applied to any nationality,given the proper datasets used in the methodology. The analysis can beaccomplished on Intel Xenon hexacore processor E5620, clock cycle 2.40GHz with 20 GB RAM running Windows Server 2012 R2 standard.

The Seed address for the crawler is taken as the address of theuniversity from where Indian academicians are to be explored. Around5800 seed addresses from 26 different countries have been collected. 260websites from 26 countries (10 from each country) are analyzed manuallyand two common major structure patterns are analyzed. Firstly, eachwebsite contains various sections such as academics, academicians,campus life, events, etc. Each of these sections further containsclassified subsections. These subsections further contain moreclassified subsections and so on. Secondly, the target academicianwebpages are found under the academic section (or subsection) of websitein a classified manner. Example: The academicians of computer sciencedepartment of a university are expected to be found under computerscience department section (or subsection) of the website. Consideringthis structure of the website, the crawler needs to identify thesesections or subsections of website.

To keep the direction of crawling towards relevant webpages, all theafore-discussed filtering techniques are applied. Checking addresses forirrelevant keywords also reduce the chances that crawler does not movein a direction where the chances of getting academician webpages arelow. After analyzing 260 websites from different countries, a list ofkeywords was prepared that represents the section (or subsection) ofwebsite that does not have academician relevant data. These keywordlists contain words like “news”, “events”, “syllabus”, “timetable”,“scholarship”, “download” etc. For every secondary address found,presence of the aforesaid strings in text portion thereof is checked andif any such keywords are found, the secondary address and correspondingprimary address are discarded.

Upon passing all the filtering techniques, the relevancy of address isjudged by naïve bayes protocol based classifier. The training data usedfor this classification is a labeled dataset that consists of twokeyword datasets.

-   -   1) Academic keyword set: It consists of those keywords that        point to those section/subsection of university website where        the chance of finding academicians are high. This contains an        exhaustive list of keywords, department names, academic section        of the university. This database consists of around 27        disciplines that were used for the training of the distiller.    -   2) Irrelevant keyword set: This set consists of keywords that        are common in university website but moves the crawlers further        away from the target academicians' webpage. For example campus        map, event, syllabus, etc.

These two databases used by the distiller were updated continuously withtime. Referring to FIG. 4, the relevancy is checked in the end by NaïveBayes Protocol for set of secondary addresses and if found relevant itis passed to the address queue, else added to address store. This ismaintained to avoid visiting the same address multiple times. Theimportant features of the text of the secondary addresses are foundusing intelligence matrix and set to a proper format for categorizingthe addresses as relevant or irrelevant.

The filtered URLs are passed onto the classifying unit for deciding ifcorresponding webpage is an Indian origin academician webpage or not.Two types of webpages have been identified:

-   -   A central academician webpage which contains a secondary address        to dedicated academician webpage.    -   A dedicated academician webpage that contains information about        a single academician.

The algorithm extracts both types of webpages. The webpages areprocessed as a separate thread by the system. After analyzing severalcentral and dedicated academician webpages, it is observed that it hashigh probability that primary and secondary address sets of most ofdedicated academician webpages contain the name of that particularacademician. In an instance, the secondary address from where thewebpage of “Ram Kumar” is found will contain the addresses having wordsRam Kumar. Apart from this, the dedicated academician address may alsocontain words like view profile, view biography, etc.

Before passing the corresponding webpage to classifying unit, a check isapplied to decide if string presents in the secondary address representthe name of a person or not. One way of doing this is to create a hugedataset of names and then applying search for the existence of a stringin that dataset. The other solution could be having a trained machinelearning system to classify the strings as name or not name. Thisapproach seems to be the best solution to this complicated problem. But,the crawler will encounter people with new names all the time, somanually updating the datasets everytime for a new name cannot be asolution.

Many existing string databases were explored and the best suitable foundfor this particular problem was English dictionary. Let us consider thename “Narendra Modi”. It comprises of two words Narendra and Modi, bothof which are not found in English dictionary. Consider yet another name,“Robin Gautam”. The word “Robin” can be found in the dictionary as aspecies of birds but the surname “Gautam” is not present in thedictionary.

This pattern was sensed and used for deciding if the address andcorresponding set of secondary addresses passed by the distilleractually represents a person's webpage or not. The string is splitthrough white spaces and presence of each subpart in dictionary ischecked. The absence of any part in dictionary hints that the string isa name. This method may fail at times but can give good results andavoid the need for manual addition of words as required in othermethods. Further, to validate the effectiveness of this method namesdataset provided by the social security Administration US(Administration 2016), under national data was used. Around 93% of thenames were correctly identified. This proves to be a promising approachin name determination. The names it failed to check are the ones thatcomprise of English dictionary words.

But the project under consideration has one more open end, it wasrequired to find specifically the Indian academicians (although proposedtechnique can be applied to any nationality academicians). To do thistask, three keyword databases: Indian surname database, Indian premierinstitute name database and Indian cities having 5800, 2000 and 275entries respectively were used. Along with this, a dataset of 1700webpages from 26 countries is used for the training purpose. Thetraining set consist of 1700 randomly selected examples both relevantand irrelevant from 26 countries whose seed addresses have beencollected.

As indicated in FIG. 5, the addresses are passed from the distiller andvarious filtering techniques are applied. The corresponding webpage isextracted and stopwords are eliminated. If the probability is above athreshold the webpage is marked as relevant i.e. being of an Indianorigin academician, the set of secondary addresses is extracted.Otherwise, the webpage is ignored and control is transferred to the nextwebpage.

Table 1(a) below shows the percentage accuracy obtained by implementingso disclosed Naïve Bayes protocol based focused web crawling system andmethod. Accuracy is measured to inform how correctly the crawler canclassify webpages of Indian origin academicians, wherein the results arepresented in the form of a confusion matrix. As the number of classes istwo i.e. relevant or irrelevant, so confusion matrix has two dimensions.Rows of confusion matrix represent the actual classification and columnsdenote the classification as predicted by the present system and method.

TABLE 1(a) Accuracy (in percentage) Protocol Training Testing KNN 100.089.6 ± 0.78 SVM 92.0 91.2 ± 0.53 NB 85.0 84.4 ± 0.70 Decision tree 87.287.0 ± 0.65 Naïve Bayes Protocol 94.3 92.0 ± 0.53

Furthermore, Table 1(b) below shows the Average Cost obtained byimplementing so disclosed naive bayes protocol based focused webcrawling system and method. The cost matric can have differentdefinitions in other scenarios. Classification cost has been used here,wherein the cost involves giving a reward for every correctclassification and penalties for misclassification of the webpage. Forthe cost matrix, p_(ij) represents the penalty for misclassifying anexample in class i to j. The total cost of the algorithm is calculatedby the sum Σ confusion matrix×cost matrix. Only average cost has beenrepresented for the task i.e. (total cost)/number of elements inconfusion matrix.

TABLE 2 Average cost Protocol Training Testing KNN 0 0.578 ± 1.03 SVM0.873  0.88 ± 1.00 NB 0.517 0.589 ± 0.69 Decision tree 0.261 0.638 ±1.47 Naïve Bayes Protocol 0.201 0.274 ± 1.06

For constructing above two tables, the dataset was split into train/testsets. Cross-validation has been used in which data was divided into ksubsamples and remaining (n−1) subsamples are used for constructing therules. The accuracy is calculated by using an average of n subsamples.This method is heavy on resources but gives highly accurate results.

Referring to Table 1a and 1b, the measure of accuracy and cost is shownrespectively for K Nearest Neighbours (KNN), Support Vector Machines(SVM), Naïve Bayes (NB), Decision tree and Naïve Bayes Protocol (NBP).The standard deviation in both tables is calculated using √{square rootover (v(1−v)/N)}, where v is the measured value and N is the number ofwebpages in test dataset. KNN and SVM methods perform well in terms ofaccuracy. The prime reason for the success of these methods maybebecause of the features selected for these methods is relevant andrelated to each other. NB and decision tree perform moderately foraccuracy. The Naïve Bayes Protocol (NBP) gives best performance in caseof test data. Due credit in this case should be given to the procedurethat filters out the addresses early and makes it not only simple butefficient as well. This is also the reason that the cost is lowest forNaïve Bayes Protocol in table 1b. Cost for SVM is highest as it tries tomaximize the accuracy and not cost.

FIG. 6A shows an experimental performance chart of precision versus truepositive for disclosed Naïve Bayes protocol in comparison to otherstate-of-the-art protocols. Further, the proposed crawler was tested onopen web to get the webpages of Indian origin academicians from foreignuniversity websites by varying True Positives (TP) in the training set.After training with varying TPs, seed addresses from different countryacademic websites are provided. As the number of true positive isincreased the performance of plurality of compared methods alsoimproves. NBP performed best in the scenario and SVM and NB alsoperformed well.

Referring to FIG. 6B, an experimental performance chart of harvest ratioversus retrieved webpages for disclosed Naïve Bayes protocol incomparison to other state-of-the-art protocols has been presented.Harvest ratio is the measure at which relevant webpages are acquired andirrelevant webpages are filtered off from crawling. The presentinvention which included Naïve Bayes Protocol outer performs in eachcase as the irrelevant addresses are filtered out at the initial stage.Further distance to relevant webpages is reduced with each step andchances of crawler getting skewed is very less.

It should be apparent to those skilled in the art that many moremodifications besides those already described are possible withoutdeparting from the inventive concepts herein. The inventive subjectmatter, therefore, is not to be restricted except in the spirit of theappended claims. Moreover, in interpreting both the specification andthe claims, all terms should be interpreted in the broadest possiblemanner consistent with the context. In particular, the terms “includes”and “including” should be interpreted as referring to elements,components, or steps in a non-exclusive manner, indicating that thereferenced elements, components, or steps may be present, or utilized,or combined with other elements, components, or steps that are notexpressly referenced. Where the specification claims refers to at leastone of something selected from the group consisting of A, B, C . . . andN, the text should be interpreted as requiring only one element from thegroup, not A plus N, or B plus N, etc. The foregoing description of thespecific embodiments will so fully reveal the general nature of theembodiments herein that others can, by applying current knowledge,readily modify and/or adapt for various applications such specificembodiments without departing from the generic concept, and, therefore,such adaptations and modifications should and are intended to becomprehended within the meaning and range of equivalents of thedisclosed embodiments. It is to be understood that the phraseology orterminology employed herein is for the purpose of description and not oflimitation. Therefore, while the embodiments herein have been describedin terms of preferred embodiments, those skilled in the art willrecognize that the embodiments herein can be practiced with modificationwithin the spirit and scope of the appended claims.

While embodiments of the present disclosure have been illustrated anddescribed, it will be clear that the disclosure is not limited to theseembodiments only. Numerous modifications, changes, variations,substitutions, and equivalents will be apparent to those skilled in theart, without departing from the spirit and scope of the disclosure, asdescribed in the claims.

I/We claim: 1) A system for focused web crawling, comprising: a crawlerconfigured to extract addresses of plurality of webpages similar to areceivable at least one seed address; a distiller configured tosequentially refine said addresses using plurality of filtrationtechniques and naïve bayes protocol, thereby transferring a first set ofrelevant addresses for being iteratively passed onto said crawler; and aclassifying module operable to categorize said plurality of webpages forrelevancy via said protocol; thereafter deriving a second set ofrelevant addresses for being iteratively passed onto said distiller,wherein said crawler retrieves said plurality of webpages as well asaddresses associated with said plurality of webpages; wherein saidsystem further comprises of a queuing unit for maintaining at least onelist of said addresses extracted from said crawler; wherein said queuingunit strategically updates said list based on sequential inputs receivedfrom said crawler; wherein said classifying module formulates anintelligence matrix from a training unit configured to conceptualizesaid relevancy; and wherein said training unit further includes akeyword extraction unit containing a list of keywords to be analyzedtherein for said conceptualization. 2) The focused web crawling systemas claimed in claim 1, wherein front address from said first set ofrelevant addresses is passed onto said crawler. 3) The focused webcrawling system as claimed in claim 1, wherein said crawler alsomaintains a crawling history that includes but not limited to crawledpart of said plurality of webpage, time taken to download a file, numberof said iterations. 4) The focused web crawling system as claimed inclaim 1, wherein said plurality of filtration techniques are selected tobe but not limited to checking top level domain of said addresses,checking no out of domain address, checking duplicity in alreadyprocessed addresses, checking duplicity in yet to be processedaddresses, discarding addresses based on irrelevant keywords. 5) Thefocused web crawling system as claimed in claim 1, wherein said trainingunit conducts procedures including but not restricted to stopwordelimination, stemming, generation of set of features based on occurrencefrequency, implementation of said naïve bayes protocol. 6) The focusedweb crawling system as claimed in claim 5, wherein said training unitfurther shortlists said set of features using approaches including butnot limited to document frequency approach, information gain approach,chi-square statistics approach, term strength approach. 7) The focusedweb crawling system as claimed in claim 1, wherein said classifyingmodule categorizes said plurality of webpages by comparing saidintelligence matrix. 8) A method to implement focused web crawling,comprising steps of: inputting a seed address into a subsequently formedaddress queue; iteratively extracting a primary address from saidaddress queue; iteratively invigilating said primary address forpresence in an address store; if not present, extracting set ofsecondary addresses from webpage of said primary address; applyingplurality of filtering techniques as a passing criteria on said set ofsecondary addresses; if passed, verifying said set of secondaryaddresses for presence of a set of predefined keywords; upon successfulverification, classifying said set of secondary addresses for relevancyvia naive bayes protocol; transferring relevant set of secondaryaddresses into said address queue, else into said address store;simultaneously calculating primary conditional probability of a set ofpredefined webpage(s) using said protocol; sequentially calculatingplurality of secondary conditional probabilities pertaining to saidwebpage(s) of said iteratively extracted primary addresses; classifyingsaid webpage(s) having said secondary conditional probability higherthan said primary conditional probability as relevant webpage(s), elseirrelevant webpage(s); and transferring addresses of said relevantwebpage(s) into said address queue, else into said address store. 9) Themethod to implement focused web crawling as claimed in claim 7, whereinsaid primary address is preferably front address in said address queue.10) The method to implement focused web crawling as claimed in claim 7,wherein said plurality of filtration techniques are selected to be butnot limited to checking top level domain of said addresses, checking noout of domain address, checking duplicity in already processedaddresses, checking duplicity in yet to be processed addresses,discarding addresses based on irrelevant keywords. 11) The method toimplement focused web crawling as claimed in claim 7, whereincalculation of said secondary conditional probability is supported byplurality of preliminary procedures including but not restricted tostopword elimination, stemming, generation of set of features based onoccurrence frequency. 12) The method to implement focused web crawlingas claimed in claim 7, wherein classification of said relevant set ofsecondary addresses and addresses of said relevant webpage(s) occursconcurrently.